We load the Census 2011 Urban Agglomerations (Class-I) data.
library(tidyverse)
data <- read_csv("../data/ua_populations.csv")
data## # A tibble: 2,777 x 9
## ua_no ua year area population pop_change pop_change_perc~ pop_male
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Grea~ 1961 540. 4515495 1298591 40.4 2694748
## 2 1 Grea~ 1971 560. 6596370 2080875 46.1 3820973
## 3 1 Grea~ 1981 588. 9421962 2825592 42.8 5290416
## 4 1 Grea~ 1991 1041. 12596243 3174281 33.7 6891222
## 5 1 Grea~ 2001 1135. 16434386 3838143 30.5 9021789
## 6 1 Grea~ 2011 1063. 18394912 1960526 11.9 9872271
## 7 2 Delh~ 1961 327. 2359408 922274 64.2 1327386
## 8 2 Delh~ 1971 446. 3647023 1287615 54.6 2028091
## 9 2 Delh~ 1981 581. 5760811 2113788 58.0 3186119
## 10 2 Delh~ 1991 685. 8471625 2710814 47.1 4630456
## # ... with 2,767 more rows, and 1 more variable: pop_female <dbl>
A histogram will help us understand the overall distribution of population values. Let’s plot one.
Note how we have specified the figure height and width at the beginning of the R code chunk.
data_2011 <- data %>% filter(year == 2011)
ggplot(data_2011) + geom_histogram(aes(x = population), bins = 100)Not surprisingly, that was very skewed. There are a few mega cities with much larger population than most others. Plotting the population in log scale might be a useful thing in such cases. Add + scale_x_log10() to make such a plot. Adjust the number of bins if needed.
data_2011 <- data %>% filter(year == 2011)
ggplot(data_2011) + geom_histogram(aes(x = population), bins = 60) + scale_x_log10()Which cities had the largest population in 2011? As we saw earlier, a bar plot will be ideal to show values for several items. Let’s use such a plot to show the populations of the top 30 cities. Would you plot the ua names on the x axis or on the y axis? Try both if you’re unsure.
## Fill in the geom_bar aesthetics correctly
data %>%
filter(year == 2011) %>%
top_n(30, population) %>% # top n by population
mutate(ua = factor(ua)) %>%
mutate(ua = reorder(ua, population)) %>% # Order the factor according to population
ggplot() +
geom_bar(aes(x = population, y = ua), stat = "identity") +
scale_x_continuous(label = scales::comma)If you prefer comma separated numbers (e.g. 10,000) on the y axis instead of the scientific notation (e.g. 1.0e+05) you could add + scale_x_continuous(label = scales::comma). If the scales package is not installed, you will need to install it using install.packages("scales").
Let’s plot a line chart showing population growth over time.
filtered_uas <- data %>%
filter(ua == "Delhi U.A." | ua == "Greater Mumbai U.A." )
## Fill the correct aesthetics:
ggplot(filtered_uas) + geom_line(aes(x = year, y = population, color = ua))Add a few more cities to the plot above. If you have a long list, instead of chaining each one with a ua == condition, the filter can use %in% operator: filter(ua %in% c("city A", "city B", "city C"))
filtered_uas <- data %>%
filter(ua %in% c("Delhi U.A.", "Greater Mumbai U.A.", "Kolkata U.A.", "Bruhat Bangalore U.A.", "Hyderabad U.A. @", "Chennai U.A. $"))
## Fill the correct aesthetics:
ggplot(filtered_uas) +
geom_line(aes(x = year, y = population, color = ua)) +
geom_point(aes(x = year, y = population))Which UA’s had the highest growth in population during 1991-2011? To answer this, we need to measure growth relative to their respective base population. In other words, we have to divide the population column by the 1991 population value for the corresponding city. To do this, we need to first create a column with the 1991 populations.
pop_1991 <- data %>%
filter(year == 1991) %>%
select(ua_no, population) %>%
rename(population_91 = population)
pop_1991## # A tibble: 467 x 2
## ua_no population_91
## <dbl> <dbl>
## 1 1 12596243
## 2 2 8471625
## 3 3 11110314
## 4 4 5416903
## 5 5 4137314
## 6 6 4344437
## 7 7 3631233
## 8 8 2493987
## 9 9 1514545
## 10 10 1518235
## # ... with 457 more rows
That gives us the 1991 populations. Now we need to add (or merge, or join) this column to the orginal data frame. dplyr join operation is what we need here. Run the command vignette("two-table") on the console for help.
data %>%
inner_join(pop_1991, by = "ua_no")## # A tibble: 2,776 x 10
## ua_no ua year area population pop_change pop_change_perc~ pop_male
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Grea~ 1961 540. 4515495 1298591 40.4 2694748
## 2 1 Grea~ 1971 560. 6596370 2080875 46.1 3820973
## 3 1 Grea~ 1981 588. 9421962 2825592 42.8 5290416
## 4 1 Grea~ 1991 1041. 12596243 3174281 33.7 6891222
## 5 1 Grea~ 2001 1135. 16434386 3838143 30.5 9021789
## 6 1 Grea~ 2011 1063. 18394912 1960526 11.9 9872271
## 7 2 Delh~ 1961 327. 2359408 922274 64.2 1327386
## 8 2 Delh~ 1971 446. 3647023 1287615 54.6 2028091
## 9 2 Delh~ 1981 581. 5760811 2113788 58.0 3186119
## 10 2 Delh~ 1991 685. 8471625 2710814 47.1 4630456
## # ... with 2,766 more rows, and 2 more variables: pop_female <dbl>,
## # population_91 <dbl>
You can now see the population_91 column added to the data (you may have to scroll to the right).
Now we divide the population column by the base population column.
relative_pop <- data %>%
inner_join(pop_1991, by = "ua_no") %>%
filter(year >= 1991) %>%
mutate(population = population / population_91) For plotting, let us choose only in the top 10 cities that grew the fastest between 1991 and 2011.
fastest_growing <- relative_pop %>%
filter(year == 2011) %>%
top_n(10, population)Now we have to filter the data to include the rows only for these top-10 cities. As we don’t have a list of names with us, we can’t easily do a filter() here. semi_join() comes to our help. It is identical to inner_join, but does not include any columns from the second data set.
relative_pop <- relative_pop %>%
semi_join(fastest_growing, by = "ua_no")
## Fill in the aesthetics
ggplot(relative_pop) +
geom_line(aes(x = year, y = population, colour = ua)) +
geom_point(aes(x = year, y = population)) # Add points on top of the lineThe data gives percentage change in population. What story does that tell us? For each city, let us calculate the average percentage change in population across the years?
change <- data %>%
group_by(ua_no, ua) %>%
summarise(average_change = mean(pop_change_percent, na.rm = TRUE)) %>%
ungroup() %>% # For top_n to work correctly, we need to undo the grouping
top_n(30, average_change)
change## # A tibble: 30 x 3
## ua_no ua average_change
## <dbl> <chr> <dbl>
## 1 14 Ghaziabad U.A. 95.5
## 2 25 Malappuram U.A. 118.
## 3 27 Kannur U.A. 79.1
## 4 34 Faridabad 87.8
## 5 40 Vasai-Virar City [9] 96.3
## 6 50 Durg-Bhilainagar U.A. 139.
## 7 57 Guwahati U.A. 79.3
## 8 62 Gurgaon U.A. 103.
## 9 64 Bhubaneswar U.A. 99.6
## 10 67 Bhiwandi U.A. 78.8
## # ... with 20 more rows
Plot this average_change values in this data as a bar graph in the same way we plotted the populations in the first section.
ggplot(change) +
geom_bar(aes(x = average_change, y = reorder(ua, average_change)), stat = "identity")Now, let’s create a bar plot of the sex ratios of the 30 biggest cities in 2011.
filtered_uas <- data %>%
mutate(sex_ratio = 1000 * pop_female / pop_male) %>% ## Fill this in. Females per 1000 males
filter(year == 2011) %>% ## Add a filter for the year
mutate(ua = reorder(ua, sex_ratio)) %>%
top_n(30, population)
## Add the aesthetics
ggplot(filtered_uas) + geom_bar(aes(x = sex_ratio, y = ua), stat = "identity")Okay, that wasn’t bad. But the black bars starting from 0, taking up a lot of our plot area doesn’t look great. How could we fix this? Maybe we can centre the values around 1000? 1000 is the “normal” one would expect, and the plot will show the deviation from it.
Note that the last line uses a function to correct the x axis tick labels. As we subtracted 1000 from the sex ratio value earlier, we are adding it back, so that the original values appear on the plot. Such functions without a name are called anonymous functions. They are very useful for doing such conversions.
filtered_uas <- filtered_uas %>%
mutate(sex_ratio = 1000 * pop_female / pop_male - 1000)
ggplot(filtered_uas) +
geom_bar(aes(x = sex_ratio, y = ua), stat = "identity") + ## Fill in the aesthetics
scale_x_continuous(labels = function(label){label + 1000})If you see the sex ratio dinosaur, you got it right! :)
How can we understand the population growth pattern of the cities that had the largest decrease in sex ratios between 1991 and 2011? In the UA data wrangling practice sheet, we computed the change in sex ratio. Let us build on that to produce some visualisation.
## See the UA data wrangling Rmd for more details
sr_change <- data %>%
mutate(sex_ratio = pop_female / pop_male * 1000) %>%
filter(year == 2011 | year == 1991) %>%
select(ua_no, ua, year, sex_ratio) %>%
pivot_wider(names_from = year, values_from = sex_ratio) %>%
mutate(difference = `2011` - `1991`)
sr_change_bottom_20 <- sr_change %>% top_n(-20, difference)Now we create a long format of the data. Plots to show time series become easier with this format.
long_data <- data %>%
rename(total = population, male = pop_male, female = pop_female) %>%
select(-c(area, pop_change, pop_change_percent)) %>%
pivot_longer(cols = c(male, female), names_to = "sex", values_to = "population")
long_data## # A tibble: 5,554 x 6
## ua_no ua year total sex population
## <dbl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 1 Greater Mumbai U.A. 1961 4515495 male 2694748
## 2 1 Greater Mumbai U.A. 1961 4515495 female 1820747
## 3 1 Greater Mumbai U.A. 1971 6596370 male 3820973
## 4 1 Greater Mumbai U.A. 1971 6596370 female 2775397
## 5 1 Greater Mumbai U.A. 1981 9421962 male 5290416
## 6 1 Greater Mumbai U.A. 1981 9421962 female 4131546
## 7 1 Greater Mumbai U.A. 1991 12596243 male 6891222
## 8 1 Greater Mumbai U.A. 1991 12596243 female 5705021
## 9 1 Greater Mumbai U.A. 2001 16434386 male 9021789
## 10 1 Greater Mumbai U.A. 2001 16434386 female 7412597
## # ... with 5,544 more rows
Since we want to show the time pattern for each city separately, we use facet_wrap(). It creates a grid of similar plots for every city (as we are faceting by the ua variable). This plot brings out the widening gap between the male and female populations in some of these cities. Among these, are there cities you are familiar with? How would you explain this pattern? Is migration playing a role?
long_data %>%
semi_join(sr_change_bottom_20, by = "ua_no") %>%
ggplot() +
geom_line(aes(x = year, y = population, colour = sex)) + ## Fill in x and y
facet_wrap(~ua, scales = "free_y") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) ## Rotate the labels so they don't overlap. Try commenting this out.Now we create a scatter plot summarising the population, population density and sex ratios of the UAs in 2011. We use the log of population since it has a very skewed distribution.
cities_scatter <- data %>%
filter(year == 2011) %>% ## Filter for year
mutate(
sex_ratio = 1000 * pop_female / pop_male, ## Fill this
pop_density = population / area## Fill this
) %>%
ggplot(aes(x = log10(population), y = pop_density, label = ua)) + # WHat should the label be?
geom_point(aes(size = log10(population),
colour = sex_ratio,
alpha = 0.5)) + # alpha sets the transparency level
scale_color_gradient(low = "red", high = "green") +
scale_y_reverse() # reverse the y axis - denser points sink deeper
cities_scatterTo identify which point correspond to which city, we will need to label them. However, as there are hundreds of cities, it is not possible to have all the labels. So, we will have to add labels based on some condition. As an example, we show below how you label the points with extreme values for sex ratio and population density. Extreme values are identified using the quantile() function to get the 0.2 and 0.98 quantiles. In other words, 2nd and 98th percentile of the variables.
ggrepel package is useful in automatically adjusting the label positions in the plot so that they don’t overlap. Install the package if it is not there.
library(ggrepel)
cities_scatter_labelled <- cities_scatter +
geom_label_repel(aes(label = if_else(sex_ratio > quantile(sex_ratio, 0.98, na.rm = TRUE) |
sex_ratio < quantile(sex_ratio, 0.02, na.rm = TRUE) |
pop_density > quantile(pop_density, 0.98, na.rm = TRUE) |
pop_density < quantile(pop_density, 0.02, na.rm = TRUE),
ua, ''))) # If condition is true, use UA name, else empty label
cities_scatter_labelledPlotly is a library to create interactive visualisation. They have an API that follows the tidyverse philosophy. This book is a great resource to get started if you are interested. Plotly also provides a magic API to bring static ggplot plots to life. Here we will use only that.
Hover your mouse over the plot and check the tooltips. This way, we can manage without adding the labels.
plotly::ggplotly(cities_scatter)